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Abstract 

Respondent-driven sampling (RDS) is a commonly used substitute for random sampling when studying hidden pop- 
ulations, such as injecting drug users or men who have sex with men, for which no sampling frame is known. The 
method is an extension of the snowball sample method and can, given that some assumptions are met, generate unbi- 
ased population estimates. One key assumption, not likely to be met, is that the acquaintance network in which the 
recruitment process takes place is undirected, meaning that all recruiters should have the potential to be recruited by 
the person they recruit. Here we investigate the potential bias of directedness by simulating RDS on real and artificial 
network structures. We show that directedness is likely to generate bias that cannot be compensated for unless the 
sampled individuals know how many that potentially may have recruited them (i.e. their indegree), which is unlikely 
in most situations. Based on one known parameter, we propose an estimator for RDS on directed networks when only 
outdegrees are observed. 

By comparison of current RDS estimators' performances on networks with varying structures, we find that our 
new estimator, together with a recent estimator, which requires the population size as a known quantity, have relatively 
low level of estimate error and bias. Based on our new estimator, sensitivity analysis can be made by varying values of 
the known parameter to take uncertainty of network directedness and error in reporting degrees into account. Finally, 
we have developed a bootstrap procedure for the new estimator to construct confidence intervals. 
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1. Introduction 

Hidden populations (hard-to-reach populations), such as injecting drug users (IDU), men who have sex with men 
(MSM), and sex workers (SW) and their sexual partners, are generally considered as critical actors in the HIV epi- 
demic [1, 2, 3]. Consequently, obtaining population characteristics and risk behaviors of these populations are critical 
for developing efficient disease control strategies. However, the lack of sampling frames for such populations makes 
traditional estimation methods based on random samples practically useless. Other methods have been proposed for 
such situations, for example key informant sampling [4], targeted/location sampling [5] and snowball sampling [6]. 

A more recent method is Respondent Driven Sampling (RDS), which was proposed to overcome difficulties when 
sampling hidden populations [7, 8, 9]. The RDS method starts with an initial selection of respondents, which are 
called "seeds". Each seed is given a number of "coupons" - tickets for participation in the study - to distribute to 
friends and acquaintances within the population of interest. When interviewed (anonymously), a new respondent is 
in turn given coupons to distribute. Everyone is rewarded both for completing the interview, and for recruiting their 
peers into the study. If the recruitment chains are sufficiently long, the sample composition will stabilize and become 
independent of the seeds. Additionally, information about who recruits whom and each respondent's personal network 
size (degree) are recorded. 
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There are two significant improvements in RDS compared to other non-random methods that has been used when 
sampling hidden population: Firstly, it uses dual incentives to encourage respondents to recruit more peers into the 
study. Secondly, when several assumptions are fulfilled, unbiased estimates of population proportions can be obtained 
by utilizing the recruitment and degree information collected in the sample. 

Suppose a RDS study is performed on a connected undirected network with the additional assumptions that: 

(i) sampling of peer recruitment is done with replacement; 

(ii) each participant recruits one new participant to the study; and 

(iii) participants recruit randomly from their neighbors. 

Then, the sampling probability of an individual i will be proportional to its degree when the sample reaches equilib- 
rium. The population fraction pa having a certain property A (e.g. pa could denote the fraction among intravenous 
drug-users that are HIV-positive) can then be estimated by the weighted proportion of the sample fraction as in Volz 
and Heckathorn [9]: 
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where the sample population U has been divided into two disjoint subsets A and B - A c depending on the reported 
properties of respondents, and d\ denotes the degree of individual i in the sample. 

Based on equating the number of crossrelations between subgroups of property A and B, Salganik and Heckathorn [10] 
proposed an another widely used estimator for pa: 

Pa = 2 5~. C 2 ) 
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where Da = ^ A d -i and Dg = are the estimated harmonic mean degrees for the two subgroups, (yia and n# 

denote the number of A- and B-individuals in the sample respectively), and $ba denotes the sample fraction of all B- 
respondents naming A-peers and similarly sab denotes the sample fraction of all A-respondents naming Z?-peers. For 
simplicity, we henceforth refer to (1) and (2) the VH out and SH out estimators respectively; the subscript out indicates 
that respondents out-degrees have been recorded, which will be important when we move to directed networks later 
on. 

The ability to produce unbiased population estimates and a feasible field implementation have contributed to a 
rapid increase in RDS studies conducted globally in recent years [2, 11]. There has also been an increase in studies 
evaluating the performance of RDS estimators as well as in developing new estimators [12, 13, 14, 15]. 

Previous studies are mostly based on the assumption that relationships are reciprocal and the probability of being 
recruited by a peer is likewise equal on both ends of a relationship. Consequently, the network among which recruit- 
ments could take place is undirected. However, it is well-known that social networks, such as friendship networks, 
are generally directed to various extents. For example, in the study of Scott and Dana [16], only 6,669 out of 12,931 
"best friend" nominations were found to be reciprocal, and in the study conducted by Wallace [17, 18], an average 
of 55.0 reciprocal nominations per respondent were found while the mean degree was 94.8. Evidence of irreciprocal 
recruiter-recruitee relationships has also been found in many RDS studies, e.g., in a RDS study of IDUs in Sydney, 
Australia [19], 29% of the respondents consider the relationship to their recruiter to be "not very close", and in a 
study of IDUs in Tijuana, Mexico [20], only 62% of the respondents consider their relationship with their recruiter as 
"friend". Additionally, in a study for MSM in Beijing, China [21], 8.5% participants said they received their coupons 
from a stranger, and between 3% to 7% of recruitments were found to be from strangers in the RDS studies on drug 
users and MSMs in three US cities and in St. Petersburg, Russia [22]. 

In [23], it has been shown that current RDS estimators may generate relatively large biases and errors if the studied 
networks are directed, indicating that estimates from previous RDS studies should be interpreted and generalized with 
caution. This study aims to further evaluate the influence of structural network properties on the performance of 
RDS estimators under the assumption that the underlying social network is (partially) directed, and to derive new 
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estimators allowing networks to be directed. In our evaluation, we use simulated data, as well as a real online MSM 
social network, to generate networks with varying directedness, degree correlation, indegree-outdegree correlation 
and homophily. Based on one known parameter, we propose a new estimator for RDS on directed networks and 
show how sensitivity analysis can be used to generate estimate intervals induced by intervals of unobserved network 
properties. Additionally, we develop a bootstrap procedure for the new estimator to construct confidence intervals. 



2. RDS estimation on directed networks 

We now investigate the properties of the RDS process on a directed network. For the purpose of this study, we 
focus on the problem of estimating the community fraction p& having a certain dichotomous property A. Let G denote 
a (partially) directed network and let ey = 1 if there is a directed edge from i to j and ey = otherwise. A reciprocal 
edge between i and j is hence reflected by ey = e» = 1. We assume that G is strongly connected, i.e., there is a 
directed path between any pair of nodes - otherwise we of course would not be able to to estimate p A well since it 
may then be impossible to reach certain parts of the community with RDS. Finally, we let yV denote the community 
size, most often an unknown quantity in hidden or hard-to-reach populations. In what follows, assumptions (i)-(iii) 
are assumed to be fulfilled in the RDS process. 

2.1. Extension ofVH ou , estimator to directed networks 

When a RDS process takes place on a strongly connected network G, the recruitment of new respondents are 
dependent only on the current respondent, since he will select a new respondent uniformly from his peers. Thus, 
RDS possesses the Markov property [24] and can be modeled as a Markov process with transition matrix R = {ay = 
ejj/d" 1 ", 1 ^ i, j ^ AO, where d°"' is the outdegree of node i [9, 23]. This process has a unique equilibrium distribution 
7r = [m ■ ■ •tin] satisfying R T n T = n T , indicating that n is the eigenvector corresponding to eigenvalue 1 for R T . 
Consequently, n, can be used to obtain the Hansen-Hurwitz estimator where observations are weighted by the inverse 
of the sampling probability [23]: 
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It has been shown that when the network is undirected, n, = d\l 2 dj is the analytical solution for n and p A can 

7=1 

be estimated by the VH 0Ut estimator: p™ ,m ' = £ dr l Z df l . 
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Unfortunately, no analytical solution for ji is available for a general directed network. However, note that under 
the above assumptions, the RDS process is merely a random walk on the network, for which we can easily adopt the 
mean field approach in [25] to derive an approximation of tt: 

Let K = (K,„,K oi „) be the set of nodes in the network with indegree K,„ and outdegree K„ ut , and let /k be the 
proportion of K-nodes in the set; then, the average inclusion probability of nodes in K is 



Using that we have a random walk assumed to be in equilibrium, and taking the average over all nodes of degree 
K, we get 

where k oi „(j) is the outdegree of node j. 

Then, the sum over j is parted into two, one over the degree classes K' and the other over the nodes within each 
degree class K'. We substitute nj with the mean value within its degree class K', yielding 

Nf K V K '<»' U U NfK * K <"" 
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where £k'^k is the total number of edges pointing from nodes of degree K' to nodes of degree K, which we can write 
as £k'^k = K/k/kA ^'^ = K,„/kA^/k'|k, where / K '|k is the proportion of edges pointing to nodes in K originating in 
K'. 

We finally obtain 

TT(K) = K,„ V ^TT(K'). (7) 

^ out 

Particularly, when there is no degree dependence in the network, (7) becomes 

ZK' „ t /K'/K,„ 1 K,„ 

K , m) = Nir> (8) 
K , ^ out " K In 

where K,„ is the average indegree in the network, implying that for networks with no degree-degree correlations, 
the RDS sample can be weighted by respondents' indegrees to estimate population proportions, which gives us the 
modified VH out estimator: 

pVHm = «VnA (9) 

Clearly, the difference between VH oul and VH m estimators is the substitution of outdegree with indegree. Thus, the 
use of VHj„ brings new challenges in the implementation of RDS as it requires collection of respondents' indegrees, 
which are not known from the RDS sample. 

2.2. Extension of SH 0Ut estimator to directed networks 

The SH 0Ut estimator was developed based on the fact that in any undirected network, the number of crossgroup 
edges pointing from A to B, equals the number of edges pointing from B to A. Similarly, in a directed network, the 
sum of nodes' indegrees in a group equals the total number of edges pointing to nodes in that group, i.e., if we let 

Saa Sab 
S ba S BB 

in group A which end in group B, then we have 
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be the recruitment matrix in the network, where, e.g., S ab is the proportion of edges originating 
; have 

I NaD° a u 'Saa + N b D b "'S B a = N A D% 
{N a D a "'Sab + N B D B "'S BB = N B D™' ' 
where, e.g., D™" is the average outdegree in group A. 

For simplicity, let m* — 2^ and w* = be the average indegree and outdegree ratio of the two groups of nodes 

in the network, and let (p — be the relative group size proportion. Dividing the above equations (10) gives a solution 
of<*: 



w*S A a -m*S BB S BA m*S BB -w*SAA^ 

<P — Al ( ) ■ (11) 
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Then, if we can correctly estimate m*, w* and S , we obtain a generalization of the SH ml , estimator: 

P S A H " = (12) 
1+0 

in which we replace unknown population quantities in by their estimates from the RDS sample. 

From the previous section, the average indegree ratio m* in S H m can be estimated by the harmonic mean ratio 

<W E (d'p' 

of indegrees from the sample for networks with no degree correlation: m* = — fS ^ — . It is however generally 
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not possible to consistently estimate w* and S using only the average outdegree and observed recruitment matrix. 
The sample mean outdegree will be an unbiased estimator only if there is no correlation between the indegree and 
outdegree of nodes, while the harmonic mean of outdegree is expected to have higher precision if the indegree- 
outdegree correlation is high. However, in simulations it is seen that there is little difference in using either the 
(arithmetic) mean or harmonic mean of outdegree to estimate w* and thereby we continue to use the harmonic mean 
in the following analysis. We have also tried to adjust potential bias in the estimation of S by replacing individual 
inclusion probabilities with group inclusion probabilities (see Supportive Information (SI) for details), which however 
didn't improve the results and we therefore prefer to use the observed recruitment matrix from the sample to estimate 
S in SH in . 

The factor w* was named the activity ratio in [13], since it quantifies how active nodes in different groups are in 
building their personal networks. Following this, we henceforth refer to m* as the attractivity ratio, as it reflects how 
"attractive" nodes in different groups are, or to which group of nodes edges are inclined to connect to. 

2.3. Sensitivity analysis when indegree is not known 

Hardly ever is the indegree observed in RDS studies. Consequently, the use of VH m and SH m is limited in practice. 
It is however possible to use both estimators if prior information is available. In S Hj„, if the indegree is not known, 
the estimate of average indegree ratio, m*, becomes an unknown parameter in (12). This is true also for VH in , since 
we can rewrite (9) as: 
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Replacing D'^/D'g with m, we have: 



*VH m n A ln B 

Pa = — ; — — ■ ( 13 ) 
n A /n B + m 

Prior information may, for example, be obtained by expert opinion, or by using previous empirical results. What's 
more, even if there is little prior knowledge about the targeted population, we can, instead of providing a point estimate 
with fixed parameters, use a range of m values to generate an estimate interval for p* A . That is, if m* is assumed to lie 
within a certain range, [/«,„,-„, m max ], we get an interval of p A , [p A (m m i n ), p A (m max )], by varying m in (12). 

Following this, we will perform a sensitivity analysis on SH m , and VH m , by varying the ratio of average indegrees 
m to get an interval of the estimates of p A , [p A (m m i n ), p A (m max )], with m lying in a certain range, [m m i n ,m max \. By 
choosing an interval centered on a value of m based on prior information, we will get intervals of possible p A val- 
ues which more fully accounts for the situation when the network is directed, and provides valuable results on the 
sensitivity of estimators to the correctness of indegree assumptions about the network. 



3. Network Data and Study Design 

We will evaluate the performance of our suggested estimators and compare them with existing estimators through 
simulations of RDS processes on directed networks. The simulations will be performed on both artificially generated 
families of directed networks as well as a real MSM online social network [23], which makes it possible to study the 
impact of different, carefully controlled, structural network properties on our estimators as well as looking at their 
behavior in a more realistic setting using actual data. 

In our evaluation, we will consider the following parameters which are important both to directed networks and 
RDS estimation: 

Directedness; if E dir is the number of directed edges in a network with E edges, then the proportion of directed 
edges is: 

A = E dir /E, (14) 

i.e., A = when the network is undirected, and A = 1 when the network is (extremely) directed in a way such that 
there are no reciprocal edges. 
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Indegree correlation; the tendency that nodes with high indegrees are connected with each other. To quantify this, 
we use the assortativity ratio defined in [26]: 

E' 1 E« jih - [E' 1 Zi \Ui + k)f 
7= n, (15) 

where j, and k, are the indegrees of vertices at the end of the i th edge, i = l,..,,E. 

Indegree-outdegree correlation; unlike the indegree correlation, which describes associations between nodes, the 
indegree-outdegree correlation measures the correlation between indegree and outdegree for the same node. We use 
the Pearson correlation calculated from all nodes in the network: 

p = Cov(<f ", d°"')/o>,o>,». (16) 

Homophily; the probability that nodes connect with neighbors that are similar to themselves with respect to the 
studied feature A rather than that they connect randomly [27, 28, 29, 8]. Letting Iia be the homophily for nodes with 
trait A, it holds that Saa — ka + (1 - Iia)pa, implying that Iia can be calculated as: 



h A = 1 - SabIpb. (17) 

The activity ratio w*, as well as the attractivity ratio m*, are also used as network structure parameters in our 
assessment. 



3.1. Network Data 

We will focus our study of network properties on directedness and attractivity ratio, as they are of general interest 
to the study of directed networks, and of particular interest to our estimators. Furthermore, we will vary other network 
structural properties that are likely to affect our estimators. For example, the VHj„ estimator is based on the assumption 
of no indegree correlation in the network, and the estimate of average outdegree in SH jn is based on the assumption 
of positive indegree-outdegree correlation, etc. 

In order to study the behavior of our estimators with respect to variation in directedness and attractivity ratio, we 
will use two families of generated networks, Netl, where there is little or no indegree correlation and no indegree- 
outdegree correlation, and Net2, which have varying homophily and positive indegree-outdegree correlation. This 
setting makes it possible to see how different structural properties, e.g. homophily, will affect our estimators as 
directedness and attractivity ratio are varied. 

Netl is generated starting from a random pure directed network, in which indegree and outdegree are uncorrelated 
(p ~ 0). Then, the irreciprocal edges are rewired in a particular way that doesn't change nodes' degree in order 
to generate networks with different levels of directedness (down to A — 0.2) while the indegree-outdegree correla- 
tion remains unchanged. Finally, nodes are assigned either property A or B to achieve different attractivity ratios 
m* € [0.7, 1.4] (see Table 1). The generating process for Net2 starts with a random undirected network. To obtain 
directedness, reciprocal edges are randomly rewired in such a way, that for any network in Net2 with directedness A, 
the indegree-outdegree correlation is p ss 1 - A. Then, different attractivity ratios are generated as for Netl, and we 
further rewire edges with respect to nodes' properties in order to achieve different levels of homophily: Ka g [0,0.5] 
(see Table 1). As we in this study are mostly interested in the case when sample size is relatively small compared to 
the population size, these networks are both of size 10,000. 

Table 1: Basic statistics of Netl, Net2, Net3 and the MSM network 





Network 


Average 


Directed- 


indegree corre- 


indegree-outdegree 




Homophily 


Attractivity 






size (N) 


degree (D) 


ness (A) 


lation (y) 


correlation (p) 




(h) 


ratio (m*) 


P 


Netl 


10,000 


10 


[0,1] 


[-0.09,0.01] 


S3 




[-0.30, 0.22] 


[0.7,1.4] 


70% 


Net2 


10,000 


10 


[0,1] 


[-0.03,0.14] 






[0, 0.5] 


[0.7,1.4] 


30% 














age 


0.23 


0.95 


77% 


MSM 


16,082 


17.2 


0.61 


0.03 


0.39 


ct 


0.50 


1.32 


39% 


Network 












cs 


0.03 


0.96 


40% 














pf 


0.06 


1.05 


38% 


Net3 






[0.61,0.91] 


[0, 0.4] 













* Same as the MSM network 
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The anonymized online social MSM network used in this study (previously analyzed in [30, 23]) has 16,082 nodes 
and comes from the Nordic region's largest and most active web community for homosexual, bisexual, transgender 
and queer persons (www.qruiser.com) and includes information on the relationships between members as well as 
members' personal information. Contacts between members on the web site are maintained by a "favorites list", on 
which each member can add any other member without approval from that member, so that the resulting social network 
will be directed. From this network, we obtain the giant strongly connected component of the friendship network of 
those members who identify themselves as homosexual males. Utilizing information from their user profiles, we can 
evaluate our estimators on different personal characteristics, and we will focus on four dichotomous properties: age 
(born before 1980), county (live in Stockholm, ct), civil status (married, cs), and profession (employed, pf). The 
proportions of nodes having a specific value of these properties are listed in Table 1 . 

While this network provides an opportunity to study our estimators in a more realistic setting, it and the studied 
personal properties will obviously have certain structural properties. In order to keep this level of realism, while still 
varying some structural network properties, based on the MSM network, we generate a family of networks, Net3, 
which have different levels of indegree correlation (y e [0, 0.4]). Detailed information on the generation process of 
Net3 and the other networks can be found in the supportive information SI. 

3.2. Simulation Design 

RDS processes are then simulated on the above networks and estimates of RDS estimators are compared with 
true population properties. In each simulation, seeds are uniformly selected and coupons are randomly distributed 
to the recruiters' neighbors. To simulate RDS in real practice, we let the number of seeds be 10 and the number of 
distributed coupons be 3 when shorter sample waves are desirable, and, 6 and 2 for longer sample waves (provided 
in the supporting information). Sampling is done without replacement and we choose sample size 500 for Netl and 
Net2, and 1000 for the MSM network and Net3. All simulations are repeated 1000 times. 

For each simulation, we estimate the population proportion with our suggested estimators as well as existing 
estimators. Then, the root mean square error (RMSE), standard deviation (SD) and bias of estimators are calculated 
in order to quantify the results. The estimators are divided into four categories: 

(i) The naive estimator: The raw sample composition; 

(ii) Outdegree-based estimators: SH out and VH 0Ut ; 

(iii) Indegree-based estimators: SHj„ and VHi„; 

(iv) Estimators based on known parameters: S H,„- and VH m *. 

Note that the indegree-based estimators are practically useless, since individual indegrees are not known from the 
RDS sample, and are therefore presented merely for comparison and theoretical purposes. 

Additionally, we include the estimator S S (Succesive Sampling) suggested in [15] for reference; note that this 
estimator is based on a different estimation procedure and requires knowledge of the population size (N) in order to 
yield correct estimates. Since the SS estimator is developed for RDS on undirected networks, two versions of it will 
be used in order to adapt it for use with directed networks: SS out using outdegrees of respondents in the sample in the 
estimation procedure and SS i„ using their indegrees. In the estimation procedure of SS out and SS i„, M = 500 times 
successive sampling samples per each of r — 3 iterations are used. 

4. Results 

4.1. Estimation performance on networks with known properties 
4.1.1. Networks with varying structural properties 

We start by looking at how varying directedness and attractivity ratio affects RDS estimators in an otherwise 
uncomplicated setting, i.e., the generated networks with close to zero indegree correlation and no indegree-outdegree 
correlation, Netl. In Figure 1, the bias, SD and RMSE of the raw sample composition, VH 0Ut and VH m > are shown 
in the top row (S H\„, S H m -, and VH m perform very similar to VH m * and are thus left out), and the same figures for 
VH m ', SS out , and SS ,„ are shown in the bottom row for visual clarity. 

We can see in the top row that both the raw sample composition and VH 0U , are biased with increasing \m* - 1|, 
and that VH out has the same level of bias and RMSE as the sample composition as long as the network is directed, 
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Figure 1: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Netl. Sampling without replacement, number of seeds=10, 
coupons=3, sample size=500. 



i.e., A > 0. On the other side, the indegree-based estimator, VH m *, generates negligible bias and consistently smaller 
RMSE. 

In the bottom figures, we see that SS/„ and SS 0Ut have varying bias (smaller than VH 0UI ) as directedness and 
attractivity ratio changes, while retaining a substantially lower SD; on the other hand, VH m * has smaller bias but 
larger SD. Consequently, the RMSE of SS ,„ and SS out becomes similar with that of VH nt due to their small SD; 
sometimes, the RMSE of SS ,„ and SS ou , is even smaller than that of VH m -. 

The results in Figure 2 for RDS on networks with varying indegree-outdegree correlation, but no homophily, Net2, 
are similar to those seen in Figure 1, except that the bias and RMSE of VH out now increase gradually with the direct- 
edness of the network, generating bias and RMSE smaller than the raw sample composition, but larger than VH m . . 
This difference in performance of VH oll , on Netl and Net2 shows that for RDS on network with indegree-outdegree 
correlation, the traditional outdegree-based estimators, can still be expected to give less estimate bias and error than 
the raw sample composition. However, the indegree-outdegree correlation have little effect on the performance of 
estimators utilizing known parameters, i.e., SS SS out , and VH„ t . 

In Figure 3, where homophily h = 0.4, we see that the magnitude of bias, SD and RMSE all increase for the raw 
sample composition, VH out and VH m . , indicating a clear effect of homophily on increasing RDS estimate bias and 
error. However, on the other hand, SS ,„ and SS 0Ut are quit robust to the effect of homophily, their SD ([0.008, 0.01]) 
remains substantially smaller than the rest estimators ([0.02, 0.04]), and in most cases they produce minimum RMSE. 

Overall, it is clear that previous RDS estimators are seriously affected by letting RDS processes take place on di- 
rected networks, and that our suggested estimators and the SS estimators, although all relying on previous knowledge 
about the network, shows major improvements in the quality of estimates. 

4.7.2. The MSM network and its modifications 

In the MSM network we look at four dichotomous user properties and how the estimators behave on each of 
them. As the structural properties of the MSM network are fixed, we illustrate estimator behavior through box plots, 
which are shown in the left panel of Figure 4. In each box, the central line is the median, the dot is the mean, the 
edges of the box are the 25th {qi) and 75th (^3) percentiles. Estimates being at least 1.5(^3 - q{) away from the edges 
of the box are shown as outliers beyond the whiskers. 

The traditional outdegree-based estimators, SH oul and VH out , have large bias when estimating variables with large 
homophily and attractivity ratios which significantly differ from 1, i.e., age and county. For example, their estimates 
of the proportion of MSM members who live in Stockholm are on average over 5 percentage units higher than the true 
value, and for age, civil status and profession, the sample mean has even less bias than them. 

The indegree-based estimators, SHj„ and VH,„, are generally much less biased for all variables, indicating that the 
indegree is a good approximation of sampling probability for nodes in directed networks. The m*-based estimators, 
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Figure 2: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Net2, homophily Ha = 0. Sampling without replacement, 
number of seeds=10, coupons=3, sample size=500. 




Figure 3: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Net2, homophily h& = 0.4. Sampling without replacement, 
number of seeds= 10, coupons=3, sample size=500. 



9 




Figure 4: RDS on MSM network. The right panel shows sensitivity analysis of p. "' (brown) and p A "' (blue) with m varying from 0.7 to 1.4, 
plots are horizontally shifted a few points to avoid overlapping. Sampling with replacement, number of seeds= 10, number of coupons=3, sample 
size=1000. 



S H„ f and VH m -, have a similar performance on the biasedness, with slightly smaller SD. 

Lastly, when we look at the -based estimators, SS out and SS they are biased for all the four variables, however, 
the (substantially) smaller SD makes their error lie within an acceptable range compared to SH out and VH, mt . 

In Figure 5, we can see that the results from simulations on the modified MSM network, Net3 with indegree 
correlation y = 0.4, are very similar to the results from the unmodified MSM network. The indegree correlation gives 
the indegree-based estimators, SHj„ and VH m , together with the m*-based estimators, SH m * and VH m >, a slightly 
increased bias, however, the overall performance of these estimators are better than SH 0Ut and VH oul . 

The results from the two S S estimators are practically unchanged from the MSM network, which indicates that 
these estimators are very robust to changes in indegree correlation. Again we find (substantially) smaller SD over 
all variables generated by these estimators, which makes the estimated error of SS oul and SS i„ comparatively small 
despite that they are biased on directed networks. 

4.2. Sensitivity analysis 

We perform sensitvity analysis on SH„, and VH m with respect to the attractivity ratio m. The results from the 
generated networks can be seen in Figure 6. 

Figure 6(a) shows how the RMSE of VH m changes with directedness and attractivity ratio when different values 
of m are given to the estimator as simulations are performed on Netl. It is clear that proper prior information on m 
will give small error in the estimator; note also that changes in directedness does not affect estimator performance. 

Figures 6(b) and 6(c) shows the RMSE of VH m and SH m from simulations on Net2 with homophily h — and 
h = 0.4 respectively, and it can be seen that while the estimators have similar performance when homophily is low, 
VH m generate less RMSE when m is far away from m* and homophily is high, implying that when m* is not known, 
VH m may be a better option than SH m in real practice. 
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Figure 5: RDS on Net3 with indegree correlation y = 0.4. The right panel shows sensitivity analysis of p A "' (brown) and p A "' (blue) with m 
varying from 0.7 to 1.4, plots are horizontally shifted a few points to avoid overlapping. Sampling without replacement, number of seeds=10, 
number of coupons=3, sample size=1000. 




V hi S hi S hi V hi 

Figure 6: Sensitivity analysis of p A "' and p A "' on Netl and Net2 with tested m values, (a) Netl , p A "' not shown as it is similar to p A "' ; (b) Net2 
with homophily = 0; (c) Net2 with homophily = 0.4. Sampling without replacement, number of seeds=10, coupons=3, sample size=500. 
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In the sensitivity analysis on the MSM network and Net3, which can be seen on the right side of Figures 4 and 
5, there is more variability in the estimates from age and county than from profession and civil status, as would be 
expected from the previous results. We see that the change in VH m is smaller than in SH m as m is varied; this is 
however negligible for profession and civil status. Generally, we see that we will cover the true value well by using 
the average estimates from the sensitivity analysis, which is especially interesting for county, the only property of 
which m* significantly differs from 1 . 

Overall, we can see that VH m performs better than S H m , making it the preferred choice for RDS in real practice. 
We also did simulations on the above networks with 6 seeds and 2 coupons; however, no substantial differences are 
found in the results on the performance of estimators nor for the sensitivity analysis, see supplementary material. 

4.3. Confidence interval and implementation 

An problem associated with the use of VH,„, is to construct a confidence interval around p™'" when m* - m. 
Traditionally, the standard error of RDS estimates are generated by a bootstrap procedure [31], in which replicated 
samples are drawn based on the recruitment property of original RDS samples. We modify the traditional bootstrap 
method by letting p™ 1 " substitute the traditional RDS estimator when each bootstrapped sample is produced, and then 
let the middle 90% (95%) of the ordered estimates from the bootstrap samples' estimates be the approximation of the 
confidence interval. 

We test the above procedure on Netl and Net2; for each simulation setting ([X,m*, /za]), we take 1000 RDS 
samples and for each of these 1000 samples we construct 90% and 95% confidence intervals based on 1000 replicate 
samples drawn by the above bootstrap procedure. The proportion of times that the generated confidence interval 
contains the true population value p* A , denoted as cD 90 and c£> 95 , are compared with the coverage rates of the traditional 
RDS estimator based method and are presented in Figure 7. 

Apparently, due to the large bias of VH oul when network directedness and attractivity ratio is high, the traditional 
bootstrap procedure performs quite poorly with respect to cD 90 and cD 95 . The attractivity ratio has substantial impact; 
when m* = 1, the coverage rate is generally close to the desired confidence level; however, the coverage rate drops 
rapidly as m* deviates away from 1. For example, when m* = 0.8, the coverage rate of the 90% confidence interval is 
only 29% even when the network (Netl) is undirected. 

On the other hand, the V//,„-based bootstrap procedure gives reliable and consistent confidence intervals over all 
network settings. With the exception of some extreme cases, i.e., when m* < 0.9, A > 0.6, and Via = 0.4, the coverage 
rate is fairly close to the desired confidence level and are overall better than that of the VH ou , -based procedure. 

It is of interest to use the previously suggested sensitivity analysis together with the given bootstrap procedure. 
As an illustration on how to implement the proposed methods in real RDS practice, when indegree information is not 
collected, we take data given in [32] and perform sensitivity analysis with VH m , providing confidence intervals for all 
values of m. A sample of 618 drug users in New York City, and their personal characteristics, were collected using 
RDS with eight seeds. By using our methods on this data, we produce estimates and 90% confidence intervals on the 
proportion of males and the proportion of injectors. 

It is not obvious which values of m that should be used in the sensitvity analysis. One suggestion is to let m vary 
around the observed activity ratio w* , since the indegree-outdegree correlation is positive in most social networks [16, 
17, 18]. The activity ratio (vv*, weighted) for males is 0.99, indicating that there is little difference of the size of 
personal networks with respect to gender. However, the activity ratio for injectors is 1.58, indicating that injecting 
drug users know substantially many more drug users than those who don't inject drugs. The length of the interval of 
m is arbitrarily set to 1 . 

In Fig. 8, we can see that when m = w* , the VH m estimates are equal to those given by VH out . When the network 
is assumed directed and m e [0.5, 1.5], the estimated proportion of male drug users will vary from 0.88 to 0.66. The 
proportion of injecting drug users, varies from 0.45 to 0.62 when m e [1,2], The m intervals used here are arbitrarily 
chosen and their precision thus unknown, and therefore, it is hard to draw major conclusions from this example. 
However, the above analysis conveys another important information: for each change of 0. 1 in the average indegree 
ratio, the change in the RDS estimates will be about 2 percentage units, which also may be an indication of how 
sensitive the RDS estimates are to uncertainties in the collected degree data. 
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Figure 7: 90% and 95% bootstrap coverage probability of p A ""' and p A "'* (shown in brackets) on Netl and Net2. Sampling without replacement, 
number of seeds=l(), coupons=3, sample size=500. 



5. Conclusion and Discussion 

Despite the widely acknowledged evidence of the existence of directedness among social networks, the effect of 
directedness on RDS estimates has seldom been evaluated. This could be problematic since all previously reported 
RDS estimates rely on the assumption that the studied networks are purely reciprocal, the violation of which will 
result in unknown bias. To address this situation, we have extended previous RDS estimators onto directed networks 
and evaluated their performance on networks with various structural properties. 

Our study shows that the individual indegree is a fair approximation to nodes' sampling probabilities in a RDS 
process on a directed network, and that this approximation is robust to changes in indegree-outdegree correlation 
and indegree correlation etc. The suggested indegree-based estimators SH„, and VH m can also be used with prior 
information on nf, the ratio of harmonic mean indegrees of the two groups in the sample; an approach that gives 
good results. Prior information about m* may possibly be obtained by expert opinions, or by using previous empirical 
studies related to the studied population; the need for prior information however limits this application in practice. 

To further address the reality that the indegrees are not observed from an RDS sample, we have developed a sensi- 
tivity analysis method, based on the attractivity ratio m* between the studied groups in the population, to incorporate 
the uncertainties in both network directedness and reported outdegrees. Our results show that, while it is of course best 
to have correct prior information on the network, it is possible to get a deeper understanding of how RDS estimation 
is influenced by network directedness by using sensitivity analysis. E.g., in our results from the MSM network it is 
shown that erroneous prior information may give serious errors. 
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It is difficult to find an obvious setup for the sensitivity analysis both with respect to guessing a value of m* and 
deciding to which extent it should be varied. However, since many social networks have positive indegree-outdegree 
correlation, the activity ratio w* , which is observed from the sample, may be an indicator of where to vary m from, 
and often the difference between m* and w* is small. For example, in the MSM network, the absolute difference is 
0.27, 0.17, 0.02, and 0.05 for age, county, civil status and profession, respectively [23]. 

From the results of sensitivity analysis on Netl and Net2, we can see that the performance of VH m and SH m is 
determined primarily by the attractivity ratio m*, rather than network directedness A. Thus, if the network instead 
is assumed undirected, in which the ratio of indegrees is equal to the ratio of outdegrees (m* - w*), the sensitivity 
analysis may instead be used to assess the uncertainty of reported (out)degrees. The differential function of VH m over 

H "A 

m, '-^f- \m=«* — ( -a" Y \m=w- - - „/'". 2 7 th en provides the magnitude of how much the RDS estimate would change 

if there is any reporting error in the degree information. 

Acting as a natural companion to the sensitivity analysis or estimation with prior information, our modification of 
the previous RDS bootstrap method has proven to work well on directed networks. Based on the VH,„ estimator, it 
largely outperforms the traditional V// ou ,-based method, and can construct close-to-expected confidence intervals for 
networks with varying directedness and attractivity ratio. 

Another finding, which has not been highlighted in previous research, concerns the SS estimator [15]. This 
estimator has small and overall consistent standard error among the networks tested in our paper. Given that the 
population size is known, this estimator is expected to produce RDS estimates with acceptable bias and error. 

While it is in the interest of this study to do a full evaluation of RDS on directed networks, it is worth noting that 
since RDS utilizes a peer-driven mechanism, and the recruitment rights are few and valuable, respondents are usually 
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inclined to recruit those that they know reasonably well. Such a mechanism to a large extent avoids the occurrence of 
recruitment via directed edges, and in most RDS studies, the proportion of recruitment through strangers are relatively 
small, usually less than 10% [33, 22, 21, 19, 20]; as pointed out earlier, this proportion may in some cases be larger 
though. In our results, we see that already small proportions of directed edges affects previous estimators, so while 
networks with extremely high directedness is very unlikely to occur in reality, and primarily are included out of 
theoretical interests, our evaluation shows that estimators will be sensitive to directedness in reality and that it is 
therefore an important issue to address. 

For actual RDS practice, network directedness has previously not been an highlighted issue. The suggested sen- 
sitivity analysis gives RDS practitioners the possibility to take directedness into account as it provides means to un- 
derstand the robustness of sample inference to the violation of certain assumptions: that the network may be partially 
directed, and that the degree data collected from respondents may contain reporting error. As there are no methods 
available on how to quantify network directedness, an interval of estimates based on a range of m values is currently 
the best way of understanding this issue; additionally, it may give researchers a more detailed image of the situation 
and advice on how to understand the studied population. 

We hope that the current study can inspire research beyond the purpose of studying hidden populations, such as 
sampling the contents of webpages, where indegree may be used as a cheap and efficient parameter to approximate 
the inclusion probability of random walks on internet [25, 34, 35, 36]. Other factors that might affect inclusion 
probabilities, such as transitivity, degree distribution, closeness, etc., yet need to be investigated in future studies. 
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Appendix A: Generation process of Netl, Net2 and Net3 

A.l. Netl 

Netl is the set of networks with different levels of directedness, in which the indegree and outdegree are not 
correlated (p m 0). 

Step 1 Base network. At first, a random purely directed network (A = 1) is generated by randomly distributing 
ND irreciprocal/directed edges between N nodes with the restriction that no reciprocal edges should be formed. 

Step 2 Varying directedness. In order to decrease the directedness to a given A e [0.2, 1], irreciprocal edges 
in the base network are randomly chosen and rewired to form reciprocal edges. Specifically, at each step, an edge 
i — > j between nodes v, and Vj is randomly chosen; then, if there is no link pointing from j to i, we randomly find 
an irreciprocal incoming edge of i, k — > i and an irreciprocal outgoing edge of j, j — > / (Figure 9(a)). These edges 
are then rewired as k — > /, and j — > i (Figure 9(b)), such that a new reciprocal pair of edges i <-> j is formed, and 
the degrees of i, j, k and I remain unchanged. The rewiring process is restarted from the beginning if the network is 
disconnected. 




Figure 9: Netl: Illustration of the rewiring process leading to a decrease in A 



15 



Step 3 Varying attractivity ratio. Let m* be the desired attractivity ratio. For each network generated in Step 2, 
NP* (0 < P* < 1) nodes are randomly picked and assigned property A, and the remaining nodes are assigned property 

B. Then, the attractivity ratio of the network, m' = where d' A , d'jf are the average indegrees of nodes with property 
A and B respectively, is calculated. If m' + m* , the following algorithm is carried out in order to generate a network 
with the required m* value: 

(i) Randomly pick two nodes, v, and vj, with different properties; 

(ii) If m' > m* and d A > dg, exchange the property of v, and v/, 

(iii) Else if m' < m* and d A < dg, exchange the property of v, and vf, 

(iv) Repeat (i)-(iii) until m' = m* . 

Step 4 A random undirected network of the same size and average degree is generated separately for A — 0, and 
the method described in Step 3 is used to generate different m* values in this network. 

A.2. Net2 

Net2 is the set of networks with a certain amount of indegree-outdegree correlation and different levels of direct- 
edness and homophily. 

Step 1 Base network. At first, a random purely undirected network (A - 0) is generated by randomly distributing 
ND/2 reciprocal/undirected pairs of edges between N nodes. 

Step 2 Varying directedness. In order to increase the directedness to a given A e [0, 1], reciprocal edges in the base 
network are randomly chosen and rewired to form irreciprocal edges. Specifically, in each step, a pair of reciprocal 
edges i <-> j is randomly chosen. If there is no link between two randomly chosen nodes k and / (Figure 10(a)), then 
we randomly pick one of the two links between i and j and use it to connect k and I (Figure 10(b)). Such a process 
leads to an indegree-outdegree correlation pa I - A. The rewiring process is restarted from the beginning if the 
network is disconnected. 




(b) 

Figure 10: Net2: Illustration of the rewire process carried out to increase A 

Step 3 Varying attractivity ratio. The same process as described in Step 3 in Section 5 is used to generate different 
m* values for each network generated in Step 2. 

Step 4 Homophily. In order to generate networks with different homophily for group A, Iia, links are further 
rewired in each network generated in Step 3, Let h' A be the homophily of group A in the current network. At each 
step, either a pair of irreciprocal links or reciprocal links are randomly picked (Figure \\);\f h' A > Iia, meaning that 
there are too many within-group connections, we rewire the within group links, i —* k, I — > j (or ; «-» k, I <-> j), to 
i — > j, k — > / (or i <-> j, k «-> 0, or vice versa if h' A <Ha- The above process is repeated until h' A = h^. 

A3. Net3 

Net3 is the set of networks with different levels of indegree correlation (y e [0, 0.4]), varied from the MSM 
network; all four node properties (age, county, civil status and profession) are kept. 

Stepl Base network. The MSM friendship network obtained from the web community [30, 23]. 

Step2 Varying y. A shuffling method slightly different from what was described in [37] is used to generate 
networks with different indegree correlation. At each step, we randomly pick a pair of edges, i — > j, k — > I (Figure 12). 
If the indegrees of i and I are the two largest or the two smallest among the four nodes, and j and / have the same 
property, we rewire the two edges as i — * I, k — > j. Then, the degree distribution and homophily of the network is 
kept, and the indegree correlation increases as the rewiring process progresses. We generate networks with y up to 0.4 
for each of the four properties in the MSM network. 
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Figure 1 1 : Net2: Illustration of the rewire process resulting in a change of Ha . Red: trait A, Blue: trait B 



V, 




(a) (b) 
Figure 12: Net3: Illustration of the shuffling process used for network generation. Vj and V/ have the same property. 

Appendix B: Discussions on the estimate of S x y in SH in 

It can be proven that when the network is undirected, a node will be recruited into a RDS sample with a probability 
proportional to its degree if the assumptions for SH ou , are fulfilled [10, 9]: {w, ~ drf dj}. Consequently, each edge 
in the network, {e,_>j}, has a probability [jTi^j - nj/dj ~ 1/2/ dj} to be sampled, and the observed recruitment matrix 
from the RDS sample is an unbiased estimate of S . 

However, when the network is directed, the inclusion probability for a node is no longer proportional to its degree, 
and the observed recruitment matrix from the sample will be representative only if individuals of the same group have 
similar edge formations, i.e., the personal recruitment matrix, is the same for individuals in group X. Then, the 

observed raw recruitment matrix could be an appropriate estimate of Sxy- 

A more general way is to develop a Hansen-Hurwitz type estimator for Sxy using the edges' inclusion probabilities 

Sxy = d J; , (18) 

where X and Y are the set of nodes with corresponding properties in the sample. Since {7r,-} is usually not known 
when knowledge about the structure of the network is incomplete, we might use the mean field approach to approxi- 
mate {iTi} with the average inclusion probability for nodes within group X: 

Zi^jJeXJeY 77 Xi^jJeXJeY d"/" 
*XY = ^T" = ~^r- (19) 

Frankly, both using the observed recruitment matrix from the sample, and approximating as described by (19), are 
brutal methods for estimating Sxy- We have tried both in the SH,„ estimator; however, it turns out that the adjustment 
for s made by (19) always generates larger error and bias; we thus only provide the discussions here and choose not 
to show any results in the paper. 



17 



Appendix C: Supporting figures 




Figure 13: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Netl . Sampling without replacement, number of seeds=6, 
coupons=2, sample size=500. 




Figure 14: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Net2, homophily h\ = 0. Sampling without replacement, 
number of seeds=6, coupons=2, sample size=500. 



18 




Figure 15: Bias, Standard Deviation, and Root Mean Square Error of RDS estimators on Net2, homophily h& = 0.4. Sampling without replacement, 
number of seeds=6, coupons=2, sample size=500. 




Figure 16: RDS on MSM network. The right panel shows sensitivity analysis of p A "' (brown) and p A "' (blue) with m varying from 0.7 to 1.4, 
plots are horizontally shifted a few points to avoid overlapping. Sampling with replacement, number of seeds=6, number of coupons=2, sample 
size=1000. 
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Figure 17: RDS on Net3 with indegree correlation y = 0.4. The right panel shows sensitivity analysis of p A "' (brown) and p A "' (blue) with 
m varying from 0.7 to 1.4, plots are horizontally shifted a few points to avoid overlapping. Sampling without replacement, number of seeds=6, 
number of coupons=2, sample size=100(). 
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Figure 18: 90% and 95% bootstrap coverage probability of p A m " and p A "'" (shown in brackets) on Netl and Net2. Sampling without replacement, 
number of seeds=6, coupons=2, sample size=500. 
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